Traffic Stops: A Descriptive SQL Analysis

DS002R Final Presentation

William Walz

Project Overview

Goal: To analyze traffic stop patterns across three U.S. jurisdictions using the Stanford Open Policing Project (SOPP).

Scope

  • New York State: Racial distribution trends (2010–2018)
  • California: Reporting differences (Long Beach vs. San Bernardino)
  • Florida: Search and frisk outcomes by race

Methodology

  • Data wrangling: 100% SQL (aggregating, CASE WHEN, UNION, filtering).
  • Visualization: R and ggplot2 for final rendering.
  • Analysis: Descriptive only (administrative/reporting patterns, not causation).

Trend Analysis: New York State

Key Findings

  • White Drivers: Represent the majority (70-78%), but share is declining
  • Minority Trends: Slight upward trend in recorded stops for Black and Hispanic drivers
  • Context: Data reflects recorded stops, not population demographics
SELECT
  subject_race AS race,
  COUNT(*) AS n_stops,
  YEAR(date) AS year
FROM ny_statewide_2020_04_01
WHERE subject_race IS NOT NULL
GROUP BY race, year
HAVING year BETWEEN 2010 AND 2018;
ny_stops_clean <- ny_stops |>
  mutate(across(where(bit64::is.integer64), as.integer))

ny_stops_pct <- ny_stops_clean |>
  group_by(year) |>
  mutate(
    total_stops_year = sum(n_stops),
    pct = n_stops / total_stops_year
  ) |>
  ungroup()

ggplot(ny_stops_pct, aes(x = year, y = pct, color = race)) +
  geom_line(linewidth = 1.2) +
  geom_point(size = 3) +
  scale_y_continuous(labels = percent_format(accuracy = 1)) +
  labs(
    title = "NY Traffic Stops by Race (2010-2018)",
    x = "Year",
    y = "Percent of Stops",
    color = "Race"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "bottom")

Line chart of NY traffic stops by race over time.

Agency Comparison: California

Reporting differences

  • Both agencies: Traffic stops are dominated as vehicular
  • San Bernardino: Much more heavily concentrated in vehicular stops only
  • Long Beach: Records a noticeable number of pedestrian stops

Data note: The “NA/Unknown” category likely mixes bikes, scooters, or other missing labels and should be treated as a reporting consistency rather than a real stop type

(SELECT
  type AS stop_type,
  COUNT(*) AS type_count,
  'Long Beach' AS city
FROM ca_long_beach_2020_04_01
GROUP BY stop_type)

UNION

(SELECT
  type AS stop_type,
  COUNT(*) AS type_count,
  'San Bernardino' AS city
FROM ca_san_bernardino_2020_04_01
GROUP BY stop_type);
combined_types <- ca_stops_combined |>
  mutate(across(where(bit64::is.integer64), as.integer)) |>
  mutate(
    stop_type = ifelse(is.na(stop_type) | stop_type == "", "NA/Unknown", stop_type)
  ) |>
  group_by(city) |>
  mutate(stop_type = reorder(stop_type, type_count)) |>
  ungroup()

ggplot(combined_types, aes(x = stop_type, y = type_count)) +
  geom_col(fill = "#4C72B0") +
  coord_flip() +
  scale_y_continuous(labels = comma) +
  facet_wrap(~ city, nrow = 1, scales = "free_x") +
  labs(
    title = "Stop Types: Long Beach vs. San Bernardino",
    x = "Stop Type",
    y = "Count"
    ) +
  theme_minimal(base_size = 16)

Bar charts comparing stop types in Long Beach vs San Bernardino

Post-Stop Outcomes: Florida

Search vs. frisk

  • Rarity: Searches are uncommon and frisks are even rarer (frisk rates generally well below 0.01% of stops).
  • Recorded disparities: In this dataset, Black drivers have the highest recorded search and frisk rates, followed by Hispanic and White drivers
  • Data note: Ambiguous categories (“NA,” “unknown,” “other”) are collapsed into “unknown/other” to avoid over-interpreting miscoded or missing data
SELECT
  CASE
    WHEN subject_race IS NULL OR subject_race IN ('other', 'unknown', '') THEN 'unknown/other'
    ELSE subject_race
  END AS race,
  COUNT(*) AS total_stops,
  SUM(CASE WHEN search_conducted = 1 THEN 1 ELSE 0 END) AS num_searches,
  SUM(CASE WHEN frisk_performed = 1 THEN 1 ELSE 0 END) AS num_frisk,
  1.0 * SUM(CASE WHEN search_conducted = 1 THEN 1 ELSE 0 END) / COUNT(*) AS search_rate,
  1.0 * SUM(CASE WHEN frisk_performed = 1 THEN 1 ELSE 0 END) / COUNT(*) AS frisk_rate
FROM fl_statewide_2020_04_01
GROUP BY race;
fl_search_long <- fl_search |>
  mutate(across(where(bit64::is.integer64), as.integer)) |>
  select(race, search_rate, frisk_rate) |>
  mutate(race = factor(race)) |>
  pivot_longer(
    cols = c(search_rate, frisk_rate),
    names_to = "metric",
    values_to = "rate"
  ) |>
  mutate(
    metric = recode(
      metric,
      "search_rate" = "Search Rate",
      "frisk_rate"  = "Frisk Rate"
    )
  )

ggplot(fl_search_long,
       aes(x = race, y = rate, fill = race)) +
  geom_col() +
  facet_wrap(~ metric, ncol = 2, scales = "free_x") +
  coord_flip() +
  scale_y_continuous(labels = percent_format(accuracy = 0.01)) +
  labs(
    title = "Search & Frisk Rates (Florida)",
    x = "Race",
    y = "Rate"
  ) +
  theme_minimal(base_size = 16) +
  theme(legend.position = "none")

Faceted bar chart of search and frisk rates in Florida

Conclusion

  • Data Structure: Administrative datasets reveal as much about reporting (e.g., Long Beach vs. San Bernardino) as they do about policing
  • Descriptive Trends: While disparities do exist in search/frisk rates and stop composition, these are descriptive correlations
  • For Future Work: True causal analysis would necessitate population baselines (census data) for each year and exposure rates (traffic patterns)

Data Sources

Pierson, E., Simoiu, C., Overgoor, J., Corbett-Davies, S., Jenson, D., Shoemaker, A., Ramachandran, V., et al. (2020). “A Large-Scale Analysis of Racial Disparities in Police Stops Across the United States.” Nature Human Behaviour, 1–10.

Stanford Open Policing Project (SOPP) Data. Traffic stop data compiled by the Stanford Open Policing Project and accessed through the Pomona College SQL server. Original dataset and documentation available at: https://openpolicing.stanford.edu. Pierson et al. (2020).